Synopsis

About

This project tries to fit a machine learning algorithm to the Weight Lifting Exercises Data (WLE) for the purpose of predicting the type of exercise performed. The data for this project comes from barbell lifts done by 6 participants. There measurements are taken by devices like Jawbone Up, Nike Fuelband, and Fitbit.
More information about the WLE Dataset can be found here. See the section on the weight lifting exercises dataset.

Training Data

The training data contains information on more than a hundred and fifty variables. It contains more than nineteen thousand samples of observations and can be downloaded from link given below.
Training Dataset

This dataset is used to create a model for this project.

Testing Data

The testing data contains 20 samples of observations on more than a hundred and fifty variables. It is used to evaluate the out of sample error of the final machine learning algorithm for this project. It can be downoaded from the likn given below.
Testing Dataset

Pre-Processing

Reading In The Data

We need to perform a little bit of exploratory data analysis on the training data before modelling it. Let’s read it in first.

url_training <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
url_testing  <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

if(!file.exists('training.csv')){
  download.file(url_training, destfile = 'training.csv')
  download.file(url_testing, destfile = 'testing.csv')
}

training <- read.csv('training.csv')
testing  <- read.csv('testing.csv')

Now that we have got our data, its time to explore it. According to the information provided the observations are taken by accelerometres placed at various locations, like arm, barbell, etc. The first seven variables are thus, not related to the outcome of interest here ((classe variable)). So, its better to remove them.

# remove first 7 columns from training
train <- training[ , -c(1:7)]
# convert each column to class numeric
for(i in 1:152){
  train[, i] <- as.numeric(train[, i])
}

NA Values

Some of the variables contain a large proportion of NA values which can be easily seen. These need to be removed as they don’t contribute much to the model.

# Columns with less than 1000 non NA values removed
train <- train[, (colSums(!is.na(train)) > 1000)]
# look at the data
library(knitr)
knitr::kable(train[1:5, 1:5], caption = "First 5 Columns are now not empty")
First 5 Columns are now not empty
roll_belt pitch_belt yaw_belt total_accel_belt gyros_belt_x
1.41 8.07 -94.4 3 0.00
1.41 8.07 -94.4 3 0.02
1.42 8.07 -94.4 3 0.00
1.48 8.05 -94.4 3 0.02
1.48 8.07 -94.4 3 0.02

Correlated Variables

library(caret)
# configure a correlation matrix of predictors
corMat <- cor(train[, -53])
# configure columns to remove
corVars <- findCorrelation(corMat)
# remove highly correlated variables
train <- train[, -corVars]
train_b <- train

Selecting Features

We have narrowed ourselves to 45 variables. We need to select only those variables that explains the variation to the maximum extent. For this we use principal component analysis to capture 95 percent of the variation.

library(caret)
preProc <- preProcess(train[, -46], method = "pca")
# remove columns vith low contribution to the variation
train_a <- predict(preProc, train)
train_a$classe <- as.factor(train$classe)

Now, we use this processed data to create a model.

Fitting the Model

Resampling Method

Before fitting a model let’s specify the resampling method and required parameters.

library(caret)
# resampling method is Cross Validation with 10 folds
fitControl <- trainControl(method = "cv", number = 10, allowParallel = T) # allow parallel processing

Data Partition

Split the data into training and testing sets.

library(caret)
set.seed(420)
inTrain <- createDataPartition(train_a$classe, p = .7, list = F)
train <- train_a[inTrain, ]
test <- train_a[-inTrain, ]

RF

Fit a random forest to the data.

library(caret)
x <- train_a[, -1]
y <- train_a[, 1]
# fit the random forest fit 
set.seed(920)
rfFit <- train(x, y, method = "rf", trControl = fitControl)

Now we have a random forest fit that is done is such a small time using parallel processing. We can check the accuracy on the held out cross validation sets.

rfFit$resample
##     Accuracy     Kappa Resample
## 1  0.9737991 0.9668696   Fold02
## 2  0.9687273 0.9604192   Fold01
## 3  0.9716157 0.9640805   Fold03
## 4  0.9752547 0.9686974   Fold06
## 5  0.9687045 0.9604043   Fold05
## 6  0.9708242 0.9630670   Fold04
## 7  0.9774217 0.9714331   Fold07
## 8  0.9716157 0.9640709   Fold10
## 9  0.9723837 0.9650739   Fold09
## 10 0.9642857 0.9547931   Fold08
confusionMatrix.train(rfFit)
## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction    A    B    C    D    E
##          A 28.2  0.5  0.0  0.0  0.0
##          B  0.1 18.5  0.3  0.0  0.1
##          C  0.1  0.3 17.0  0.8  0.2
##          D  0.1  0.0  0.1 15.5  0.1
##          E  0.0  0.0  0.0  0.0 17.9
##                             
##  Accuracy (average) : 0.9715

Also on the held out test set.

confusionMatrix(predict(rfFit, test), test$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1197    0    0    0    0
##          B    0  752    0    0    0
##          C    0    0  739    0    0
##          D    0    0    0  659    0
##          E    0    0    0    0  782
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9991, 1)
##     No Information Rate : 0.2899     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   1.0000    1.000   1.0000   1.0000
## Specificity            1.0000   1.0000    1.000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000    1.000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000    1.000   1.0000   1.0000
## Prevalence             0.2899   0.1821    0.179   0.1596   0.1894
## Detection Rate         0.2899   0.1821    0.179   0.1596   0.1894
## Detection Prevalence   0.2899   0.1821    0.179   0.1596   0.1894
## Balanced Accuracy      1.0000   1.0000    1.000   1.0000   1.0000

DT

Next, we fit a decistion trees model to the data.

library(caret)
# fit a decision tree model
set.seed(920)
dtFit <- train(x, y, method = "rpart", trControl = fitControl)

# check it out
library(rattle)
fancyRpartPlot(dtFit$finalModel, palettes = c("Greys", "Oranges"), caption = "columns are named 'PC-col no.' becuase of PCA")

Let’s see how this method performed. First, on the held out sets of resamples.

confusionMatrix.train(dtFit)
## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction    A    B    C    D    E
##          A 21.4  7.6 10.7  4.8  5.8
##          B  3.3  7.3  2.0  7.5  5.2
##          C  2.7  1.9  3.8  1.2  1.5
##          D  0.8  2.0  0.6  1.8  1.4
##          E  0.3  0.5  0.4  1.1  4.5
##                             
##  Accuracy (average) : 0.3876

And then on the held out test set.

confusionMatrix(predict(dtFit, test), test$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1030  393  631  259  313
##          B  162  345   92  349  287
##          C    0    0    0    0    0
##          D    0    0    0    0    0
##          E    5   14   16   51  182
## 
## Overall Statistics
##                                           
##                Accuracy : 0.3771          
##                  95% CI : (0.3623, 0.3921)
##     No Information Rate : 0.2899          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.1682          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8605  0.45878    0.000   0.0000  0.23274
## Specificity            0.4557  0.73645    1.000   1.0000  0.97431
## Pos Pred Value         0.3922  0.27935      NaN      NaN  0.67910
## Neg Pred Value         0.8889  0.85936    0.821   0.8404  0.84460
## Prevalence             0.2899  0.18213    0.179   0.1596  0.18939
## Detection Rate         0.2495  0.08356    0.000   0.0000  0.04408
## Detection Prevalence   0.6360  0.29910    0.000   0.0000  0.06491
## Balanced Accuracy      0.6581  0.59761    0.500   0.5000  0.60352

GBM

Next, we fit a GBM model to the data using parallel processing.

# fit a GBM model
set.seed(920)
gbmFit <- train(x, y, method = "gbm", verbose = F, trControl = fitControl)

Let’s see how it performs. First, on the held out test sets in the folds.

confusionMatrix.train(gbmFit)
## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction    A    B    C    D    E
##          A 25.6  2.3  1.0  0.6  0.5
##          B  0.6 14.5  1.1  0.2  1.5
##          C  0.9  1.5 14.4  1.9  1.3
##          D  1.2  0.5  0.7 13.3  0.8
##          E  0.2  0.5  0.2  0.4 14.3
##                             
##  Accuracy (average) : 0.8216

Second, on the held out test set.

confusionMatrix(predict(gbmFit, test), test$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1110   89   26   11   15
##          B   20  577   39    3   43
##          C   25   63  635   78   45
##          D   37   11   33  559   38
##          E    5   12    6    8  641
## 
## Overall Statistics
##                                           
##                Accuracy : 0.853           
##                  95% CI : (0.8418, 0.8637)
##     No Information Rate : 0.2899          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8135          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9273   0.7673   0.8593   0.8483   0.8197
## Specificity            0.9519   0.9689   0.9378   0.9657   0.9907
## Pos Pred Value         0.8873   0.8460   0.7506   0.8245   0.9539
## Neg Pred Value         0.9698   0.9492   0.9683   0.9710   0.9592
## Prevalence             0.2899   0.1821   0.1790   0.1596   0.1894
## Detection Rate         0.2688   0.1397   0.1538   0.1354   0.1552
## Detection Prevalence   0.3030   0.1652   0.2049   0.1642   0.1628
## Balanced Accuracy      0.9396   0.8681   0.8985   0.9070   0.9052

SVM

Next, we fit a SVM model to the data using parallel processing again.

# fit the svm model
set.seed(920)
svmFit <- train(x, y, method = "svmRadial", trControl = fitControl)

Let’s see how the fit performs. First, on the held out test sets in folds.

confusionMatrix.train(svmFit)
## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction    A    B    C    D    E
##          A 28.0  1.8  0.2  0.2  0.0
##          B  0.2 16.1  0.7  0.1  0.3
##          C  0.2  1.3 15.7  1.7  1.0
##          D  0.1  0.2  0.7 14.4  0.7
##          E  0.0  0.1  0.1  0.0 16.3
##                             
##  Accuracy (average) : 0.9049

Second, on the held out test set.

confusionMatrix(predict(svmFit, test), test$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction A B C D E
##          A 0 0 0 0 0
##          B 0 0 0 0 0
##          C 0 0 0 0 0
##          D 0 0 0 0 0
##          E 0 0 0 0 0
## 
## Overall Statistics
##                                   
##                Accuracy : NaN     
##                  95% CI : (NA, NA)
##     No Information Rate : NA      
##     P-Value [Acc > NIR] : NA      
##                                   
##                   Kappa : NaN     
##                                   
##  Mcnemar's Test P-Value : NA      
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity                NA       NA       NA       NA       NA
## Specificity                NA       NA       NA       NA       NA
## Pos Pred Value             NA       NA       NA       NA       NA
## Neg Pred Value             NA       NA       NA       NA       NA
## Prevalence                NaN      NaN      NaN      NaN      NaN
## Detection Rate            NaN      NaN      NaN      NaN      NaN
## Detection Prevalence      NaN      NaN      NaN      NaN      NaN
## Balanced Accuracy          NA       NA       NA       NA       NA

RDA

Lastly, let us fit a regularized discriminant analysis on the data.

# fit the RDA model
set.seed(920)
rdaFit <- train(x, y, method = "rda", trControl = fitControl)

Let’s see how the model performs. First, on the held out test sets in the folds.

confusionMatrix.train(rdaFit)
## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction    A    B    C    D    E
##          A 20.7  0.9  0.2  0.3  0.1
##          B  2.3 15.1  2.0  1.0  1.9
##          C  4.1  2.7 14.9  3.8  1.0
##          D  1.1  0.2  0.3 10.9  1.9
##          E  0.2  0.4  0.1  0.4 13.5
##                             
##  Accuracy (average) : 0.7513

Second, on the held out test set.

confusionMatrix(predict(rdaFit, test), test$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D   E
##          A 884  34   4  11   2
##          B  99 583  74  34  73
##          C 161 108 644 166  41
##          D  47  10  12 437  97
##          E   6  17   5  11 569
## 
## Overall Statistics
##                                          
##                Accuracy : 0.7549         
##                  95% CI : (0.7415, 0.768)
##     No Information Rate : 0.2899         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.6923         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.7385   0.7753   0.8714   0.6631   0.7276
## Specificity            0.9826   0.9171   0.8596   0.9522   0.9883
## Pos Pred Value         0.9455   0.6756   0.5750   0.7247   0.9359
## Neg Pred Value         0.9020   0.9483   0.9684   0.9370   0.9395
## Prevalence             0.2899   0.1821   0.1790   0.1596   0.1894
## Detection Rate         0.2141   0.1412   0.1560   0.1058   0.1378
## Detection Prevalence   0.2264   0.2090   0.2713   0.1460   0.1473
## Balanced Accuracy      0.8606   0.8462   0.8655   0.8076   0.8580

Performance

Across Folds

library(jpeg)
library(ggplot2)
library(patchwork)
df <- rbind(rfFit$resample, dtFit$resample, gbmFit$resample, svmFit$resample, rdaFit$resample)
df$Model <- c(rep("Random Forest", 10), rep("Decision Trees", 10), rep("Gradient Boosting Machine", 10), rep("Support Vector Machine", 10), rep("Regularized Discriminant Analysis", 10))

# read in background
img <- readJPEG('forest.jpg', native = T)

# plot...
g <- ggplot(df, aes(Resample, Accuracy))
g <- g + geom_point(aes(color = Model), size = 4)
g <- g + theme(legend.background = element_rect(fill = alpha('brown', .5)), legend.text = element_text(face = 'italic', family = 'serif', size = 14, colour = 'white'), legend.title = element_text(colour = 'white', size = 16), legend.key = element_rect(fill = alpha('white', .1), color = 'white'))
g <- g + theme(panel.background = element_rect(fill = alpha('white', .1), colour = alpha('white', .1)))
g <- g + theme(axis.text = element_text(colour = 'white', size = 12), axis.ticks = element_line(colour = alpha('green', .1)), axis.text.x = element_text(angle = 90), axis.title = element_text(face = c('italic', 'bold'), family = 'serif', size = 18, colour = 'white'))
g <- g + theme(panel.grid.major = element_line(color = alpha('green', .1)))
g <- g + inset_element(img, left = -.1, right = 1.1, top = 1.1, bottom = -.1, align_to = "full", on_top = F)
g

Random forests perform excellent and way better compared to other models. SVM also performs well.

On the Test Set

The accuracies on the test set have been computed already using the confusionMatrix() function. Best performing model is Random Forest with an accuracy of 1, i.e. an error rate of zero percent.
Since random forests performed so well, we don’t need to create an ensemble of models. RF is enough. We are gonna use it to predict the validation set of 20 samples.

Validation Set

Now, for the final part we are going to predict the validation set (20 samples) using the random forest approach.

# preprocess the testing data
testing <- testing[,-c(1:7)]
for(i in 1:152){
  testing[, i] <- as.numeric(testing[, i])
}
testing <- testing[, (colSums(!is.na(testing)) > 0)]
testing <- testing[, -corVars]
preProc2 <- preProcess(testing, method = "pca")
testing <- predict(preProc2, testing)
# predict
predict(rfFit, testing)
## Error in predict.randomForest(modelFit, newdata): variables in the training data missing in newdata

This gives an error becuase the principal component analysis gave 12 components in the case of validation set. These were 25 while fitting a model in on the train dataset.
So, we are going to refit the RF fit to the train set but with 12 components this time.

preProc_modified <- preProcess(train_b, method = "pca", pcaComp = 12)
train_b <- predict(preProc_modified, train_b)
train_b$classe <- as.factor(train_b$classe)
train_b <- train_b[inTrain, ]
test_b <- train_b[-inTrain, ]
x <- train_b[ , -1]
y <- train_b[ , 1]

# RF fit
set.seed(920)
rfFit_modified <- train(x, y, method = "rf", trControl = fitControl)

# Does it still perform good?
confusionMatrix.train(rfFit_modified)
## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction    A    B    C    D    E
##          A 27.7  0.6  0.2  0.2  0.1
##          B  0.2 18.1  0.3  0.1  0.2
##          C  0.3  0.4 16.6  0.9  0.2
##          D  0.2  0.2  0.3 15.2  0.2
##          E  0.0  0.0  0.1  0.0 17.8
##                             
##  Accuracy (average) : 0.9548
confusionMatrix(predict(rfFit_modified, test_b), test_b$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1197    0    0    0    0
##          B    0  752    0    0    0
##          C    0    0  739    0    0
##          D    0    0    0  659    0
##          E    0    0    0    0  782
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9991, 1)
##     No Information Rate : 0.2899     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   1.0000    1.000   1.0000   1.0000
## Specificity            1.0000   1.0000    1.000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000    1.000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000    1.000   1.0000   1.0000
## Prevalence             0.2899   0.1821    0.179   0.1596   0.1894
## Detection Rate         0.2899   0.1821    0.179   0.1596   0.1894
## Detection Prevalence   0.2899   0.1821    0.179   0.1596   0.1894
## Balanced Accuracy      1.0000   1.0000    1.000   1.0000   1.0000

Yep, still got it.

# Now, predict the validation set
testing$classe <- NA
predict(rfFit_modified, testing)
##  [1] A A B E A B E E E C A D E E B E E E B B
## Levels: A B C D E